Amazon S3 Service Disruption
Learn about the Amazon S3 service disruption and the possible failure mitigation techniques.
We'll cover the following
Introduction#
Amazon Simple Storage Service (S3) is one of the services AWS offers. S3 is a highly secure, scalable, and durable object storage service that provides data storage and retrieval from anywhere.
On February 28, 2017, S3 started to fail in the Northern Virginia (US-EAST-1) region due to a human error. This service disruption lasted several hours, affecting many of its customers, including Slack, Netflix, and Reddit.
In this lesson, we discuss the root cause of the S3 failure and how to mitigate such failures.
How did it happen?#
The root cause of the S3 outage was a human error made during a routine debugging process. Let's look at how this happened:
An Amazon S3 team member attempted to troubleshoot an issue with the billing system. The intention was to remove one of the S3 subsystems used by the billing system.
An incorrect command was entered, which caused a significant removal of S3 servers.
The removal of the S3 subsystem caused two other dependent S3 subsystems, the index and placement subsystems, to fall. The index subsystem manages the S3 objects' metadata and location information. This subsystem also serves
GET,LIST,PUT, andDELETErequests. The placement subsystem is responsible for allocating storage for new objects. As a result of the removal of the S3 system, these two subsystems lost a notable amount of their capacity and required a full restart.The outage affected not only the S3 service itself but also other AWS services, such as Amazon Elastic Compute Cloud (EC2), AWS Lambda, and Amazon CloudFront, which rely on S3 for data storage and retrieval. The impact of the outage varied among customers and applications, but many experienced slow performance, errors, or complete downtime.
The AWS Service Health Dashboard (SHD) also failed due to the SHD administration console's dependency on Amazon S3.
1 of 5
2 of 5
3 of 5
4 of 5
5 of 5
Analysis and takeaways#
From a simple command to a catastrophe: The event was caused by an input to the command that was entered incorrectly. Analyzing inputs to the command in runtime is challenging; however, incorporating monitoring and automation tools to analyze the typo in the command and its impact might have alleviated the issue.
Recurrence of the event: To avoid a recurrence of a similar event in the future, some measures can be adopted, such as slowing down the process of removing capacity. Another would be incorporating safeguards to prevent capacity from being removed when it would cause a subsystem to fall below its minimum required capacity level.
Slow restoration: It might seem odd that restoring the service took around four hours. The cascading effect and the blast radius of the failure could have been avoided by terminating the command quickly.
SHD failure: AWS SHD provides visibility to the users. Since SHD has a dependency on S3, it should be deployed across multiple AWS regions. So if one region goes down, another could keep the users informed.
Mitigation techniques to employ#
The failure mitigation strategies that AWS either adopted or should have adopted include:
Service health checks: AWS monitored the health of its services and implemented service health checks to prevent requests from being routed to unhealthy services, reducing the impact of the outage on customers.
Circuit breakers: AWS used circuit breakers to detect and isolate service failures and prevent them from cascading across the infrastructure, minimizing the impact of the outage on other services.
The retry and backoff: AWS implemented retries and backoff mechanisms for failed API requests to reduce the number of failed requests and improve the overall availability of services.
Scaling up resources: AWS scaled up resources in unaffected regions to handle the increased demand for services and mitigate the impact of the outage.
Risk management and error checks: Human error, as is the case in the Amazon S3 service disruption, can be avoided by adopting a series of checks by the engineering managers to properly analyze the command that needs to be executed.
Resuming the system: Complex systems tend to operate in a stable state, and any deviation from that state requires careful attention and management to restore them to their steady state. S3 took so long because it has a lot of steady states, and there were billions of objects to account for. So, when a service like Amazon S3 suddenly experiences a drastic decrease in workload and a reboot of the service, it takes much longer than expected to bring it back up to its previous operational level. Careful measures need to be adopted to make such systems resilient to such extreme events.
These API mitigation strategies were crucial in alleviating the impact of the S3 outage on AWS customers. They continue to be a key component of AWS's approach to providing highly available and resilient services.
Knight Capital Failure Due to Development Bug
Facebook and Uber APIs Failure